Study finds bot detection software isn’t as accurate as it seems

Dylan Walsh

Jun 12, 2023

Why It Matters

General-purpose bot-detection algorithms trained on a particular data set may be highly error-prone when applied in real-world contexts.

In May 2022, bots allegedly played a part in holding up Elon Musk’s purchase of Twitter for $44 billion. Twitter claimed that 5% of its monetizable daily active users were automated accounts; Musk said the number was far higher. A public fight ensued, though some have suggested that Musk’s concern about bots was an excuse for him to withdraw from the deal.

That debate has been resolved, in a way — Musk purchased Twitter in October 2022 — but bots continue to present a wide range of challenges on social media, from the minor annoyance of spamming people to the potentially profound problems of spreading misinformation, influencing elections, and inflaming polarization.

And according to a recent study, existing third-party tools used to detect bots may not be as accurate as they seem. In their newly published paper, MIT researchers Chris Hays, Zachary Schutzman, Erin Walk, and Philipp Zimmer report that bot-detection models’ supposed high rates of accuracy are actually the result of a critical limitation in the data used to train them.

Identifying Twitter bots

A good deal of research focuses on developing tools that distinguish between humans and bots. Social media platforms have their own systems to identify and remove bot accounts, but they are often secret. Companies might also have reason to misrepresent the prevalence of bots, the researchers note.

Third-party bot-detection tools use curated data sets and sophisticated machine learning models trained on those data sets to detect the subtle tells of a bot — to find patterns believed to be uniquely human or uniquely not. Those models are then deployed on social media to study bots at work. Hundreds, if not thousands, of papers have been published on identifying Twitter bots and understanding their influence.

“And that’s what we initially set out to do,” said Schutzman, a postdoctoral fellow at the MIT Institute for Data, Systems, and Society. “We wanted to study the spread of harmful information, and we needed to separate real people from bots.”

He and the other researchers downloaded a set of Twitter data from a repository hosted by Indiana University. They ran an off-the-shelf machine learning model and immediately got a 99% accuracy rate in sorting bots from people. “We assumed we’d done something wrong,” Schutzman said. The researchers figured bot detection was a complex problem, and finding success with such a simple tool was surprising. They tried another data set and got similar results.

Digging deeper, they found that each model accurately classified accounts from the data set it was trained on, but models trained to perform well on one data set did not necessarily perform much better than random guessing on a different data set. This suggests that a general-purpose bot-detection algorithm trained on such data sets may be highly error-prone when applied in real-world contexts.

The researchers also found that for many of the data sets, a relatively simple model — such as looking at whether an account had ever “liked” any tweets — achieved accuracy similar to that of more complex machine learning models.

“A systemic issue”

These results indicate that even complex models that are trained with this data and then used on Twitter aren’t necessarily accurately identifying Twitter bots. “This is a systemic issue in the Twitter bot collection domain,” Schutzman said. “It likely results from data being collected in simplistic ways, often for more narrowly tailored research, and then being uncritically reused for more general bot-detection tasks.”

Training sets for bot detection are generally gathered from a particular sample of tweets, such as those that include hashtags like #COVID19 or #stopthesteal. Each tweet is then manually labeled by people as a bot or a human. (The quality of this labeling process has been questioned but is taken for granted in this research.) Researchers then use these hand-labeled data sets to train sophisticated models on how to recognize a bot or a person within a particular context — bots that are used to spam consumers, say, or to influence political partisans. From this training, the models are deployed on Twitter to detect the presence and activities of bots “in the wild.”

A new method for rooting out social media bots

Who trusts bots, and why

How do online bots shift opinions?

“Underpinning most of the bot-detection literature is this idea that bots live in different typologies,” Schutzman said. If you train a model on data related to bots scamming the stock market, then that model should be good at recognizing any kind of bot designed for financial scamming, or scamming generally. “What we show, though, is that these data sets don’t cover the space of bots nearly as well as people would hope,” he said. The accuracy of these models is highly dependent on the particular data on which they are trained rather than on fundamental differences between bots and humans.

A call for transparency

The researchers note that the problem of inaccurate bot detection is likely more severe on other social media platforms compared with Twitter. Twitter data, which is mostly text, is relatively easy to work with. Limitations on image and video analysis is an additional obstacle for researchers scraping content and studying bots on platforms like Instagram or TikTok.

In a way, the solution to this problem is simple, according to the researchers. Companies like Twitter should be more transparent with their data; they should share what they know or suspect about bots on their platforms with those who want to research the problem. Researchers would then have a holistic view of bots and their behavior, Schutzman said, allowing for far more comprehensive and reliable modeling compared with using labels manually applied to data sets that are limited in their representation.

“Without that kind of transparency around data quality, we’re stuck in this loop where we do our best: We gather data by searching hashtags or scraping particular accounts, then we pay undergrads to label them, we make that data public, and we train a model to detect election misinformation bots using data gathered 10 years prior,” Schutzman said. “Without the platforms providing better access, we’ll always have this limit on what third-party researchers can accomplish.”

Read the paper